Intro

Missing Data Structure

Before we move further with imputation of missing values it’s important to understand the overall structure of data and the missing values present within that.

Visualising Missing Data for SSA

Load the required packages.

library(tidyverse) # general data wizadry and manipulation
library(VIM) # visualise missing data patterns
library(naniar) # data structure visualisation in a 'tidy' manner
library(visdat) # accompanies the naniar package

# set environment
rm(list=ls()) # remove all objects
set.seed(12) # set the random seed to 12 so results are reproducible

First we need to read in the required data from a GitHub repositary. For this methods demonstration we will just use the BSQ subportion of the Carpinteria foodweb as an example. The process is exactly the same for all foodwebs.

bsq <- read.csv("https://raw.githubusercontent.com/SmithD19/FoodWeb/master/data/interactionwebdb/Carpinteria/BSQweb_Nodes.csv") # load data from GitHub repo - simple

A plot below using the visdat library allows us to visualise this further and gives more information about overall data structure and the type of data.

# Overall data structure and missing values using visdat library
vis_dat(bsq) +
  labs(title = "Structure and NA values of BSQ data")

We can clearly see the NA values in the data frame here marked by the grey blocking. This also makes it easy for us to select only the values we care about for imputation, the trait values and give them better, easier to write names in a new and less cluttered data frame.

bsq_wrk <- 
  bsq %>% # select and rename some variables
  dplyr::select(FunctionalGroup = ConsumerStrategy.stage., BodySize = BodySize.g., 
                Biomass = Biomass.kg.ha., Abundance = Abundance.no..ha.) %>% 
  mutate(BodySize = BodySize / 1000) # change BodySize from g to Kg so that units are equal

# take a look at the new dataset
glimpse(bsq_wrk)
## Observations: 290
## Variables: 4
## $ FunctionalGroup <fct> detritus, detritus, detritus, autotroph, autot...
## $ BodySize        <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ Biomass         <dbl> 3.779020e-02, 1.080667e+02, 1.080667e+03, 7.31...
## $ Abundance       <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...

Using the handy VIM package we can quickly visualise the missing data present in the original data set - in particular the trait values of the nodes are what we’re interested in for now.

aggr(bsq_wrk[,2:4], col=c('navyblue','red'), 
                 numbers=TRUE, sortVars=TRUE, labels=names(bsq_wrk[, 2:4]), 
                 cex.axis=.7, gap=3, ylab=c("Missing Data Histogram for BSQ","Pattern of Missing Data for BSQ"))

Using the equation \(Biomass = BodySize * Abundance\) we can conditionally fill in variables that are missing from data points using mutate() and case_when(). Below is the workflow used for conditionally replacing this data based on the equation and reassembling the data back to it’s original structure but containing the new values inferred from the equation above.

bsq_new <- 
  bsq_wrk %>% 
  # if BodySize = NA & if Biomass & Abundance != NA then BodySize = Biomass/Abundance
  mutate(BodySizeNew = case_when(!is.na(Abundance & Biomass) ~ Biomass / Abundance),
         # if Abundance = NA and BodySize & Biomass != NA then Abundance = Biomass/BodySize 
         AbundanceNew = case_when(!is.na(BodySize & Biomass) ~ Biomass / BodySize),
         # if Biomass = NA and BodySize & Abundance != NA then Biomass = Abundance * BodySize
         BiomassNew = case_when(!is.na(Abundance & BodySize) ~ Abundance * BodySize)) %>% 
  # join the columns of new and old together use coalesce -- Biomass + BiomassNew...
  # this selects the original value first so only values that are missing from original dataset and then computed are selected  
  mutate(BodySize_Work = coalesce(BodySize, BodySizeNew),
         Biomass_Work = coalesce(Biomass, BiomassNew),
         Abundance_Work = coalesce(Abundance, AbundanceNew)) %>% 
  # now dplyr::select and rename the new working colums to replace the old incomplete data set
  dplyr::select(FunctionalGroup, BodySize = BodySize_Work,
                Biomass = Biomass_Work, Abundance = Abundance_Work)

Now with the extra values we added from the inferred relationship between BodySize, Biomass and Abundance we can look at how much data we have to work with when getting ready for the imputation process.

First lets take a look at another plot using the VIM aggr plotting function.

aggr(bsq_new[,2:4], col=c('navyblue','red'),
     numbers=TRUE, sortVars=TRUE, labels=names(bsq_new[, 2:4]),
     cex.axis=.7, gap=3, ylab=c("Histogram of missing data for BSQ","Pattern of missing data for BSQ"))

Now it looks like we have 75% of the data with no missing values for any of that traits we’re interested in. This is significantly better than the 50% we had earlier and gives us a better starting point for the imputation process using the mice package, which is shorthand for Multivariate Imputation using Chained Equations.

Imputation

For imputation we have the choice of several packages in the CRAN repository. For now the most appropriate package is that of mice as shown by penone2014, without phylogenetic data this method provides the best imputation results, particularly for body size data and traits highly correlated to this.

library(mice) # R package for Multivariate Imputation using Chained Equations
# Be aware that the MASS package attatched to mice masks the select() function by dplyr
# It is therefore necessry to specifically call dplyr::select() to use this function

Dealing with co-linearity

Lets create a new data frame for the imputation process. It’s important in the imputation process to log transform values that could be correlated with each otehr, this process reduces the co-linearity of the values and therefore makes imputation more accurate.

bsq_imp <- bsq_new %>% dplyr::select(BodySize, Biomass, Abundance) %>% 
  ##  log transform the variables you want to impute, addition of 1 solves problems 
  ##  with infinite results due to logging values below 0
  mutate(BodySize = log(BodySize + 1), Biomass = log(Biomass + 1), Abundance = log(Abundance + 1))

Setup of mice()

The code below is standard setup and can be used for any imputation in mice, just change the dataset name.

init = mice(bsq_imp, maxit=0) 
init # An overview of the mice object you just created using the code above and your dataset
## Class: mids
## Number of multiple imputations:  5 
## Imputation methods:
##  BodySize   Biomass Abundance 
##     "pmm"     "pmm"     "pmm" 
## PredictorMatrix:
##           BodySize Biomass Abundance
## BodySize         0       1         1
## Biomass          1       0         1
## Abundance        1       1         0

Before we run the imputation we want to check which values mice is using for the imputation, by default this is all the variables you provide it with, this may not be the best approach as we only need BodySize and Abundance variables for our future analysis, so using the Biomass variable as a predictor that won’t be imputed but is still used in the imputation process is probably correct, but for this example we will continue with all variables being imputed and for imputation prediction but the example code of how to look at what mice is using will be shown below.

meth = init$method
init$method # The method used for the imputation of each variable
##  BodySize   Biomass Abundance 
##     "pmm"     "pmm"     "pmm"
predM = init$predictorMatrix
init$predictorMatrix # The prediction matrix for your imputed dataset
##           BodySize Biomass Abundance
## BodySize         0       1         1
## Biomass          1       0         1
## Abundance        1       1         0

mice() usage

Using the imputation function mice() we can now impute the missing data. The m value corresponds to the number of data sets imputed and for this example we will use 10 as it’s easier to show, but for real analysis more appropriate iterations of imputation are suggested (50+). The data will be imputed using the pmm method which is short for predictive mean matching. A list of methods for imputation using mice are listed below.

bsq_imp <- mice(bsq_imp, m = 10, 
                method=meth, # using both the methods and predictor matrix we specified earlier
                predictorMatrix=predM) # we didnt change any values in the predictor matrix so this isn't entirely necessary 
# the number of imputation methods are available by calling this command
methods(mice)
##  [1] mice.impute.2l.bin       mice.impute.2l.lmer     
##  [3] mice.impute.2l.norm      mice.impute.2l.pan      
##  [5] mice.impute.2lonly.mean  mice.impute.2lonly.norm 
##  [7] mice.impute.2lonly.pmm   mice.impute.cart        
##  [9] mice.impute.jomoImpute   mice.impute.lda         
## [11] mice.impute.logreg       mice.impute.logreg.boot 
## [13] mice.impute.mean         mice.impute.midastouch  
## [15] mice.impute.norm         mice.impute.norm.boot   
## [17] mice.impute.norm.nob     mice.impute.norm.predict
## [19] mice.impute.panImpute    mice.impute.passive     
## [21] mice.impute.pmm          mice.impute.polr        
## [23] mice.impute.polyreg      mice.impute.quadratic   
## [25] mice.impute.rf           mice.impute.ri          
## [27] mice.impute.sample       mice.mids               
## [29] mice.theme              
## see '?methods' for accessing help and source code

Visual analysis of imputed data

We can look at some of the imputed values and compare them to our original data set using some convenient functions provided in the mice package.

Density plots

This density plot shows us the density of imputed values compared to our original input data (red and blue, respectively), assuming that BodySize is proportional to Abundance and Biomass.

densityplot(bsq_imp)

Strip plots

A strip plot here shows the imputed values of each value compared tothe original data set for each imputed data set (red and blue, respectively). It’s much easier here to see the variance in each imputed data set. Dealing with this variance is important and we will be using some functions present in the mice package to coalesce all this data together, and then to test the significance of it compared to the original data.

stripplot(bsq_imp, pch = 20, cex = 1.2)

To complete the imputation process the complete() function does all the hard work for you, turning your mice object back into a data frame that contains all the imputed values inputted into it, giving you a complete data frame. There are various ways to complete this and even add multiple imputed data frames into one large data frame or a list.

bsq_complete <- complete(bsq_imp, "all") # use the all the imputed data sets
bsq_complete <- complete(bsq_imp, 1) # use the first imputed data set
glimpse(bsq_complete)
## Observations: 290
## Variables: 3
## $ BodySize  <dbl> 3.833558e-01, 6.565608e-06, 3.740000e-09, 3.370000e-...
## $ Biomass   <dbl> 0.03709364, 4.69195935, 6.98625837, 8.89841114, 6.64...
## $ Abundance <dbl> 0.09531018, 10.33295268, 15.06015911, 8.34503026, 9....
summary(bsq_complete)
##     BodySize            Biomass            Abundance     
##  Min.   :0.0000000   Min.   : 0.000000   Min.   : 0.000  
##  1st Qu.:0.0000000   1st Qu.: 0.000283   1st Qu.: 4.331  
##  Median :0.0000027   Median : 0.016625   Median : 9.347  
##  Mean   :0.0864769   Mean   : 0.672503   Mean   : 8.433  
##  3rd Qu.:0.0007833   3rd Qu.: 0.265289   3rd Qu.:12.256  
##  Max.   :1.6292405   Max.   :12.333661   Max.   :22.597

Alternative approaches and accuracy

There may be more appropriate methods of imputation that take adavantage of evolutionary models or phylogeneteic distances that allow constraining of imputed values depending upon the values of their closest evolutionary relatives that may have data already present or imputed. Phylogenetic data is incomplete for my datasets but the majority of data points have taxa at least to the phylum level and below. With the most speciose phyla compromising the majority of data points present.

# phylogeny count plot
plot.freq <- 
  bsq %>%
  # count factor levels and frequency
  count(Phylum) %>% 
  # rename blank factor varables to other
  mutate(Phylum = fct_recode(Phylum, "Other" = "")) %>% 
  # plot
  ggplot(aes(x = reorder(Phylum, -n), y = n)) +
  geom_bar(stat = "identity", fill = "steelblue") +
  ggtitle("Frequency plot of Phylum in BSQ") +
  xlab("Phylum") +
  ylab("Count") +
  coord_flip()
plot.freq

We can also visualise what levels of data are missing in relation to phylogenetic grouping using the naniar package and the gg_miss_fct() function that comes with this. It’s also useful to see how data is missing with regards to the functional grouping of that taxa in the data set. Combining these two extra variables we can easily see that clades and phyla typically containing smaller animals are more likely to have missing data points.

# several useful ways to visualise this missing data by functional and taxanomic grouping
plot.missing.group <- 
  gg_miss_fct(x = bsq_new, fct = FunctionalGroup) + 
    labs(title = "% NA Values by Functional Group for BSQ")

plot.missing.phylo <- 
  bsq %>% 
  dplyr::select(BodySize = BodySize.g., Biomass = Biomass.kg.ha., 
                Abundance = Abundance.no..ha., Phylum) %>% 
  gg_miss_fct(fct = Phylum) + 
  labs(title = "% NA Values by Phylogeny Group for BSQ")